-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: File row splits #1709
feat: File row splits #1709
Conversation
@@ -26,11 +26,11 @@ pub(crate) trait ReadMasked { | |||
|
|||
/// Read an array with a [`RowMask`]. | |||
pub(crate) struct ReadArray { | |||
layout: Box<dyn LayoutReader>, | |||
layout: Arc<dyn LayoutReader>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So LayoutReaders are no longer stateful?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like I overlooked some layers of of state/locks in the readers, that will take some thinking and probably additional refactoring.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok after a more careful reading and a small test, as far as I can tell while the LayoutReader
implementations are stateful (with interior mutability), they can read independently from each other as the LayoutBufferedReader
is the part that actually controls the overall flow and each stream initializes a new one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There can be multiple concurrent reads being tracked in the layout reader but are they reusable?
I thought something encoded assumptions that splits were accessed monotonically but maybe that was in the BufferedReader
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought so too, but that does seem to be in the BufferedLayoutReader
. The LayoutReader
implementation read through poll_read
which takes a RowMask
to read try and read a specific range on each call, and the BufferedReader
tracks that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should be able to access the layoutreaders out of order so we can read multiple at the same time irrespective of their children.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like a step in the right direction. I still think we should pull apart BufferedReader thingy.
And are you sure all the state machines have gone from layout readers?
|
||
// Set up a stream of RowMask that result from applying a filter expression over the file. | ||
let mask_iterator = if let Some(fr) = &self.filter_reader { | ||
Box::new(BufferedLayoutReader::new( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally we pull the splits_stream out of this thing and just stream::iter(self.splits().iter()).map(|s| self.read_range(s).boxed()).buffered(64)
or whatever?
Or is that not possible?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Honestly worried about using buffered
everywhere
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Buffered is a bit too naive. You want more control over coalescing
layout_reader: Arc<dyn LayoutReader>, | ||
filter_reader: Option<Arc<dyn LayoutReader>>, | ||
messages_cache: Arc<RwLock<LayoutMessageCache>>, | ||
row_mask: Option<RowMask>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it necessary to provide a handle-level row-mask? This sort of feels like a parameter you pass when you build a scan from the handle
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The flow I imagine is something like:
- Read some external index to create initial
RowMask
- Pass that row to the the builder to create a handle.
- Reuse the initial mask to create concurrent reads on subranges of the file
Introduces a new
VortexReadHandle
that can be used cloned and used to either generate a full stream over the underlying file OR read a specific row range.The handle also implements
Clone
so it can be re-used over the same file.DataFusion support will be in a followup PR.